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ABS TRACT 

In the past decade a number of fixed sampling methods have 
been developed for selecting the "best" or at least a "good" sub¬ 
set of variable in regression analysis. We are interested in 
deriving a sequential selection procedure to select a subset of a 
random size including all good regression equations. Tables for 
an example are given at the end of this paper. 


1. INTRODUCTION 

In the past decade a number of fixed sampling methods have 
been developed for selecting the "best" or at least a "good" sub¬ 
set of variables in regression analysis (see e.g. Arvesen and 
McCabe ( 1975) and Spjyitv/ftl 1 ( 1972)). In this paper, we are inter 
ested in deriving a sequential selection procedure to select a 
random size subset including all "good" regression equations. 










Tables for an example are given at the end of this paper. 

2, SEQUENTIAL SUBSET SEL ECTION PROCEDU RE 

Before discussing the regression problem, we develop general 
results applicable to the selection of "good" or "superior" popu¬ 
lations defined later. 

Let Tig, 7!-j,...,n k denote k+1 normal populations with unknown 

2 2 2 2 
means uq»u-| ». •. and variances og,o-|,... . Assume that a g 


is known but 


o. (1 < i < k) are unknown. Let the ranked 


2 2 2 

values of be denoted by Oj--|j <_...< We wish to derive a 

method to construct a sequential procedure to select a subset con¬ 
taining all "superior" populations - the populations with smaller 

variances, with a probability not less than P*, (0 < P* < 1), a 

? 

specified constant. We assume that og = 1. 

Let X- n denote the nth observation from population tt^ . It is 
assumed that the observations X^,...,X in are independent random 
variables. Define 


‘in -S L V 


S fn - n'l j, <hj - 

2 

The selection procedure will depend upon { s ^ n ) which is a suffi¬ 
cient and transitive sequence and also invariantly sufficient for 

A population *. is said to be "superior" (or "good") if 

o ? - A, to be "inferior" (or "bad") if o? > A, where a is a speci¬ 
fied constant greater than 1. Let u be the parameter space which 
is the collection of all possible parameter vectors 8 = 

(o^,... ,o^). Let t denote the unknown number of inferior 









populations in the given collection of k populations. We have 
0 < t ^ k. Let 

s; t = *-l°[l] °[k-t] 1 A " °[k-t+l] °[k] 1 ' 

k 

Then n = u a.. 

t=0 1 

For the subset selection procedure R, two constants A and P* 
with a • 1, 1 > P* •* 0, are specified and we wish to select a sub¬ 
set containing all superior populations with a probability of at 
least P*. When all the superior populations are contained in the 
selected subset, we say a correct decision (CD) has been made. 

Thus we require a procedure for which 

P 0 (CD|R) > P* 

for al1 o e n. 

2 2 
Let g 2 >( s j n ^ denote the probability density of s^ n depend- 

°i 

2 

ing on the parameter o^. We define the log-likelihood ratios 

a n (s in } = lQ 9 9 A (s^ n ) - log g^s^) (2.1) 

upon which the procedure is based. 

Elimin ation type sequ ential selection procedu r e R for selectin g 
the sup erior popul ations. 

Begin by taking n^ (^ 1) independent observations from each 

of the k populations. Calculate the values of the k log-like- 

2 

lihood ratios 3_ n (s. n ), 1 < i <_ k. For any i, if 

\ (s W ia - 

where a = log(k(k+1 )/2( 1-P*)), we eliminate the population tt.. from 
further consideration. We proceed to the next (second) stage by 
taking n^ - n-j independent observations on each of the remaining 
populations. The log-likelihood ratios for the contending popu¬ 
lations are again computed and the same elimination rule is used 








2 

except that n (s- ) everywhere replaces 

n ^ 1 n ^ 




We continue 


in this manner until the elimination is stopped, at which time the 
procedure is terminated with the declaration that the remaining 
populations are the superior populations. If after applying this 
rule at the sth stage (say), the number of remaining populations 
is zero, then we select the population ug which is the control 
population. 

Note that is the sample size of that stage of the proce¬ 
dure at which a decision may be made, for the first time, to 
reject one or more populations. Let n^ > n^ be the sample size of 
the next stage of the procedure at which such a decision may be 

made, and in general let n g > n $ _^ be the sample size of the stage 

of the procedure at which the sth decision to reject one or more 
populations may be made. Let N be the stage at which the proce¬ 
dure terminates. It is clear that if there are k populations to 

start with, then N < n^ (see Gupta and Huang (1975)). 

We assume that 

P ? n ^ s f n ^ -- a for some (2.2) 


2 

is a nondecreasing function of o-. A sufficient condition for 
this is discussed by Hoel (1970). Without loss of generality, we 
assume that r^,...,n k t are the superior populations. Since the 
procedure K is truncated, we have 

l-P(CDiR) Pit (s 2 ) > a for some i = l,...,k-t, 

' 1 n i n 

for some t, 0 • t <- k, for some ni 
k k -1 

V )' P v (s £ ) -• a for some n] 

t o i -1 <- 2 n in 


k k -1 ? 

< ) V P {.■ (s. ) -• a for some ni 

t-0 i-1 " n in 

k , 

> (k-t)e’ a ' \ k(k +1)e a = 

t‘o 


1-P*. 








3. APPLICATIONS TO SELECTION OF "GOOD" 
" OR "SUPERIOR" REGRESSION EQUATIONS 


Assume the following standard linear model as follows, 

Y = Xfl + e (3.1) 

where X is an nxp known matrix of rank p <_ n, 6 is a pxl parameter 

2 

vector, and e - N(0, ogI n ). Consider the models for any r, 

2 < r < p-1, 


Y X ri e ri + f ri 


(3.2) 


where X pl . is an nxr matrix of rank r with X^ = [1,...,1]^ , 

2 

is a rxl parameter vector, and e . - N(0, oI ), where i = l,...,k r = 


(£"}). Let *<= l k . The goal is to include all the designs X r • 
r=2 

2 

(or sets of independent variables) associated with cir.j, j = 

1,... ,k-t. 

Note that for any r, 2 < r <_ p-1, if 

SS ri ■ HI - X r , (X H*ri>'' X ri IV * 

then following Searle (1972, p. 57) 

SS ri /o 0 * 2{ V (X3) , Q ri (X3)/(2n 0 2 )], 
where v r = n-r, for 1 i < k . Note that the noncentrality 
parameter is not zero in general and that 

°ri = °0 + (xe)'Q ri (XB)/v r> 

2 

If og is not equal to 1, then we consider the linear model Y/og= 
Xfi/n n + t' - N(0,I ). Thus we assume without loss of general- 

Up ** 

ity that og = 1. 

We know that the non-central x^( x »x) with non-centrality 
parameter \ has monotone likelihood ratio in x. Hence the monoto¬ 
nicity of (2.2) is satisfied. We can apply the sequential proce- 

2 

dure R to select superior regression equations by replacing s n - n 
byss r1 ./v r . 


p-1 








4_,_ COMPUTATION OF (2.1) 


Let U . = SS ./\> . The probability density of U , is 
r i nr n 


9 2 (U ri ) 
J ri 


-X v A (v r U ri } 

e ). 


1 v r +k - 1 ‘ l (v r U r i ) 


k=0 i v + k , 

k! 2 2 r r(£ v r + <) 


where = 1 + (Xb) 'Q rj (XB)/v r , v f = n-r and > = ( Xb) ' Q r f ( XB)/2. 
? 2 

If • 1, then \ = 0 and if o . = A, then x = (a-1)v /2. Hence 


r i 


ri 


n /. {u ri )A ‘i (u r i } = e " A J 0 rr[- r n 


k r( z v r ) 


r(p + k) 


(4.1) 


where ' - (/'-l )v p /2. Let 


k! 


u U ■ 

r ri 


k r(£ v r ) 


r(j v r + k) 


k = 0,1,2,. 


Since 


k +1 


a. k + 1 
k 


V U ril 1 

2 "J ( 1 . + 


-♦0 as -k -+-«*, 
then for any 0 • 6 < 1, there exists q such that 


<2 v r + k) 


k + 1 

- - - < A • 1, for all k > q. 

a k 

Let us consider the error due to the truncation of the series 
in (4.1). Let q be the number of terms in the truncated series. 
Then the error due to truncation of the series in (4.1) is given 
by 

y i ■ df i 

ki-o « +k - ^ 






Given n > 0, let k Q be the smallest positive integer k such that 


a k . a k+l d k . 

— < 1 and-+ — < 1. 

n a. n - 


For this k~, it is easy to prove that 


0 ' 9 A (u r ,)/g,(u ri ) - 


V 1 

l ^ 

k=0 * 


- I a 


/. +lf 1 n - 

k=0 K 0 K 


k Q -l 


Thus 9 A (U r .)/g 1 (U ri ) • l a, with error less than n. To evaluate 


k=0 


9 A ( u ri )/9 1 (U ri ), the computation is very efficient. 


5. EXAMPLE 

In this section we present an example which will serve to 
illustrate the sequential subset selection procedure. The data 
set is taken from Neter and Wasserman (1974, p. 373), who used it 
to illustrate several methods of finding a "best" set of indepen¬ 
dent variables. 

There are n = 55 observations on p = 5 independent variables. 
Then k = 2 - 2 = 14. For the subset selection procedure R, two 
constants a and P* with a ;• 1, 1 > P* > 0, are specified and we 
wish to select a subset containing all superior regression equa¬ 
tions with probability at least P*. 

Begin by taking n ? (>_ 5) independent observations. Calculate 
the values of the k ratios g A (U ri )/g-j( U ri ) with error n (specified). 
For any r, i, If 

9a< U H>'9,«V,> ^ 

where b = k(k+l)/2(1-P*), we eliminate the regression equation 
from further consideration. We proceed to the second stage by 
taking n^ - n^ independent observations on each of the remaining 
regression equations. The ratios are again computed and the same 
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elimination rule is used. We continue in this manner until the 
elimination is stopped, at which time the procedure is terminated 
with the declaration that the remaining regression equations are 
the superior regression equations. 

Let n = 0.1. For the value of g (U r ^ )/g-j (L> r ^, this error of 
n - 0.1 is small enough with respect to constant b. Table 11-V11 
are the subsets of independent variables of elimination for the 
sequential subset selection procedure R. 

Table III, we consider a = 1.2. If P* = 0.9, then the proce¬ 
dure R eliminates (X^, X^), {X^, X^) and (X^, X^) at stage 1 
(n^ = 11); eliminate (X^, X^, X^) at stage 2 (n^ = 16) and elimi¬ 
nate (X ] , X 3 , X^) at stage 3 (n^ = 21). No subset is eliminated 
at stage 4 (n^ = 26). Thus the procedure is terminated. (X^.X^), 

(x r x 2 , x 3 ), (x r x 2 , x 4 ), (x r x 2 , x 5 ), (x r x 3 , x 5 ), (x r x 2 , 

X 3 , X 4 ), (X,, X 2 , X 3 , X 5 ), (X r X 2 , X 4 , X 5 ) and (X p X 3> X 4 , X & ) 
are the set of variables of superior regression equations. We can 
use C statistic to select one of good regression equations among 
the set of superior regression equations. For this example, 

(X-j, X^, X 4 ) is the set of variable of a good regression equation 
(cf. Neter and Wasserman (1974)). Table II-VII represents the 
results for a = 1.1, 1.2, 1.5, 2, 3 and 5; P* = 0.7, 0.8 and 0.9. 

TABLE II 


= 0.1, A = 1.1. 


p* 

n 

16 

21 

26 

31 

0.7 


(1.477(1,30 

(1,5) 

0,4,50, (1,3.4) 

no rejection 

—- 

0.8 


(TOO ,Ti ,3) 

71,4.50707X00 

0,4) 

no rejection 

— 

0.9 


0.50. (1,30 

.__ 

o ,4,5i,Ti .or 

0,3,4) 

no rejection 













n 


TABLE III 
= 0.1, A = 1.2. 


11 


0 ,4) ,Jl,3) 
775770,4) 
1U3JL_. 



16 

21 

26 


no rejection 

— 


5) 

no rejection 








(1T475) 

7,3/4 T~ 

no rejection 


TABLE IV 


n r 0.1, A = 1.5. 


6 

11 

16 

(1,47,0,37 

TOOT 

(1,4,5T,0757 

0,3,4) 

no rejection 

71 , 4757,737 

0,3,4), (1,4) 

no rejection 

( 1 ,37 

0,4,57,7777 

0,3,4),(1,4) 

no rejection 

n 

TABLE V 

= 0.1, A -= 2. 

6 

11 

16 

71,4,5") , (1,57 
0,4), 1,3) 

0,3,77 

no rejection 

"0757,71,47 ■ 
p,3) 

( 1 , 4 , 57 , 773,47 

no rejection 

0,5770,4) 

(1,3) 

(1, 4,57,777,7 

no rejection 














TABLE VI 


n = 0.1, A = 3. 


n 

P* 

6 

11 

16 

0:7 

( 1»4,5)", (1 , 5^TH , 3,4") 
CM), 1 , 3 ) 

no rejection 


0.8*. 

(1,4,5 i,n.5i 

1L.4), 1,3). ... . 


no rejection 

0.9. 

(t!4,5T,tl,5) 

(1,4),(1,3) 

(173,4T 

no rejection 


TABLE VII 


n - 0.1, .A = 5. 
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